Method of operating a speech recognition system

ABSTRACT

In methods of operating a speech recognition system, a speech signal from a user is analyzed for recognizing speech information contained in the speech signal. When situated in an active receive mode, an acknowledgement of receive activity is produced in response to an inquiry about the receive activity from a user. In another embodiment, before speech data including at least a portion of the speech signal and/or at least a portion of the speech information are transmitted from an internal user-controlled area into an external area, the respective speech data are filtered and/or a message is sent to the user that a transmission of the speech data to the external area is imminent.

The invention relates to methods of operating a speech recognitionsystem, in which methods a speech signal from a user is analyzed forrecognizing speech information contained in the speech signal. Inaddition, the invention relates to associated speech recognition systemsfor implementing the methods.

As speech recognition systems show enhancing efficiency, they are moreand more used in a large variety of fields of application. For example,already now there are dictation systems operating fairly satisfactorilyin which a speech recognition system implemented on a PC captures theuser's continuous speech, recognizes it and writes it in a text datafile which can then be further processed by the user via a customarytext processing program. Furthermore, there have been various technicaldevices for some time already which are controlled by speech recognitionsystems. These devices, however, have only a very limited context ofcommand words which are necessary for controlling the device. Suchspeech controls have many advantages. A major advantage is that a speechcontrol can be effected hands-free by the user and therefore is usefuland particularly in such situations in which the user needs his handsfor other purposes. It relates, for example, to the control ofperipheral devices such as music systems, mobile radio sets ornavigation systems in motor vehicles which are therefore ever moreoffered containing speech controls. Speech recognition systems or speechcontrols, respectively, are also extremely helpful for persons who haveconsiderable movement impediments and therefore only have speechavailable as their only communication and control means. In addition,speech controls are also advantageous in general, because they form aninterface between man and machine which interface is adapted to thenatural main communication means of man, that is to say, speech. Otherman-machine interfaces such as, for example a keyboard by which electricpulses are generated for the machine, on the other hand, are adapted tothe machine. Since the speech signals go from the speaking person to theacoustic capturing device, for example, a microphone of the speechrecognition device in a wireless manner, a speech control furtheradvantageously simultaneously provides an at least short-range remotecontrol of apparatus without the need for further circuitry for thispurpose. For example, in apparatus such as televisions, video recordersor other entertainment electronics devices in which remote control iscustomary nowadays, separate remote controls can be omitted if speechcontrols are used.

It may be assumed that providing individual devices with isolated speechcontrol systems which understand words, sentences or commands etc. fortheir field of application, is only a first stage with respect to thedevelopment of automatic speech recognition systems for general livingconditions. As a result of the continuous fast moving technicaldevelopment, a state, in which the electronic devices including possiblesecurity systems at least in certain areas in a—possiblywireless—network are mutually combined and can generally be controlledand monitored by speech, will probably be reached in some years' timealready. The speech of the user is then recorded by one or moremicrophones and processed by a central computer unit in a way that theuser can speak to the various devices or functional units in the networkas he likes. The user then interacts with a whole set of functionalunits or with a switching center for these functional units, whichunderstands the user's language and provides that the individualfunctional units or apparatus are controlled in accordance with thecommands given. In the network systems the switching function orco-ordination of the apparatus can also be performed by a plurality ofspeech recognition systems which are mutually networked instead of beingperformed by a single switching center with one speech recognitionsystem, so that they collaborate in suitable fashion. The whole complexcontrol system with the speech recognition system or with various speechrecognition systems respectively, as well as the respective connectedapparatus or such functional units may be considered a kind of“environmental intelligence”.

The whole control system may be located in isolated form in the rooms ofthe user, for example, in a living area or a certain office area. Thesystem, however, may also be connected to external devices, for example,to the Internet or to an intranet. More particularly also certain partsof a speech recognition system, for example, a highly efficient speechrecognition module may be installed in an external computer which iscalled as required via a data line, for example, via the Internet or theintranet. The speech signal is then sent to the external computer and,subsequently, the recognized speech information is sent back to therespective system on site. Such large-scale network systems areadvantageous per se because, as a rule, an accordingly powerful computeris to be available for powerful speech recognition and an appropriatenetworking makes it possible that a plurality of speech recognition orspeech control systems can share a common external computer, so thatthis computer is utilized better.

In addition to the afore-mentioned many advantages which such an“environment intelligence” has—be it in the form of separate apparatuswith separate speech recognition systems or in the form of a complexcontrol system—there is, however, on the other hand, the disadvantagethat the respective system always “belongs” to the user to extractcommands to the system from the user's conversations. The problem isthen that the user cannot easily establish on the basis of a complexnetworking of the individual speech recognition systems and apparatusand as a result of the components of the systems usually installed inthe most inconspicuous way for optical reasons, whether the speechrecognition system—or in case of a plurality of speech recognitionsystems, which speech recognition system—is active, or to what extentthe individual speech recognition systems are active, respectively.

The user all the more faces this problem when a speech recognitionsystem is concerned that is connected to an external area or if thespeech recognition system is located completely or partly in an externalarea which the user cannot fully control, and speech data of the userare switched from the “internal” user-controlled area, for example, theliving room or an office of the user to an external area. Speech dataare then understood to be either the captured speech signal itself inits original or changed form, as appropriate, or the speech informationor parts thereof recognized from the speech signal. The speechinformation recognized from the speech signal may be not only words orword combinations, sentences or the like, but also information about theidentity of the speaking person which identity information can beestablished based on the characteristic biometric information containedin the speech signal. Similarly, the speech information may also containinformation about the person's current frame of mind which can beextracted from the speech signals from, for example, changes of voice,pitch, rapidity of speech etc.

Since it is not transparent to the user whether and in which form hisutterances at a certain point of time are detected and analyzed orstored and/or listened to by a speech recognition system, situations mayarise in which the user feels disturbed by the permanent listening ofthe speech recognition system or speech recognition systems,respectively. This certainly holds for situations in which the userwishes to hold a confidential conversation. This naturally particularlyholds for a use of extremely powerful speech recognition systems whichare not only capable of understanding certain command words or commandword combinations but can capture, analyze and process the user'scontinuous speech. It is then highly unpleasant for the user not to knowwhether his speech is recorded even within the speech recognition systemor is processed in another way as, for example, a query for certainkeywords or certain sentences is made or even statistics are developedabout negative remarks which are recorded under a certain theme. This isusually obviously not desired by the user.

Therefore, it is an object of the invention to provide respectivemethods or speech recognition systems in which the user can bettercontrol in how far utterances made by him are captured and processed bya speech recognition system.

This object is achieved, on the one hand, in that the speech recognitionsystem, in so far as it is situated in an active receive mode emits tothe user an acknowledgement of receive activity in response to a user'senquiry about receive activity in which the user queries whether thespeech recognition system is situated in an active receive mode. Theidea of active receive mode is used here for a state in which speechsignals are captured and processed in some way by the system. A systemis always in the active receive mode when the speech recognition systemquasi “listens in”. In addition, there may be, for example, an operatingmode in which the system is “ready to receive”. In such a mode thesystem is only active in the way that it only waits for a certaincommand such as, for example, “speech recognizer on”, by which thespeech recognition system can be switched on as required. So the userhas the possibility of communicating with the speech recognition systemby an arbitrary word, a sentence, a word combination or via anotherdefined acoustic signal so that he hears from the speech recognitionsystem itself whether it is listening. The user thus particularly beforehe makes confidential remarks always has the possibility of beinginformed about the activity of the speech recognition system or of aspeech recognition system.

With respect to the transmission of speech data to an external, notuser-controlled area, the object is achieved in that before speech datacomprising at least a portion of the speech signal and/or at least aportion of speech information recognized from the speech signal aretransmitted from an internal user-controlled area into the externalarea, they are filtered and/or a message is sent to the user. In thisway the user keeps control of his speech data before they reach theexternal area or it is at least shown that such data are transmitted toan external area, so that he can withhold confidential utterances whichhe would not like to reach the external area.

To implement the first method, the speech recognition system needs tohave a signaling device for sending an acknowledgement of receipt to theuser to indicate to the user in some way the active receive mode.Furthermore, the speech recognition system is to be designed such thatin the active receive mode the enquiry about active receive activityfrom the user can be recognized and, accordingly, the acknowledgement ofreceive activity is transmitted via the signaling device. The signalingdevice may be a speech output device of the speech recognition system,for example, a text-to-speech converter or an output with predefined,stored audio texts which are played back to the user. In this case theacknowledgement of receive activity takes place in the form of arespective speech signal to the user, for example, via a message “speechrecognition system is active”.

To implement the second method, the speech recognition system whichcomprises a component in the external area or is connected to theexternal area so that certain speech data are transmitted to theexternal area, is to have a suitable filter device which filters thespeech data prior to their transmission to the external area.Alternatively, or additionally, it is to comprise a signaling device tosignal to the user beforehand when such a transmission of speech data tothe external area will take place. This signaling device may also be aspeech output device by which the speech recognition system emits arespective speech signal to the user via the loudspeaker.

More particularly when a plurality of speech recognition systems couldbe active, it is appropriate when the acknowledgement of receiveactivity contains information from which the user learns which speechrecognition system is concerned. In case of a plurality of networkspeech recognition systems the acknowledgement of receive activity isthen sent also collectively for all the active speech recognitionsystems via a speech output device, for example, via a message called“speech recognition systems X, Y and Z are active”.

To enhance the reliability of the method or of the speech recognitionsystem, respectively, preferably the emission of the acknowledgement ofreceive activity is tested by the speech recognition system itself. Incase of an erroneous, more particularly in case there is noacknowledgement of receive activity at all, the speech recognitionsystem reacts in the manner defined above. Preferably the systemdeactivates itself. This measure avoids the user getting the idea thatno speech recognition system is in the active receive mode because hedoes not receive any acknowledgement of receive activity in response toa query about receive activity—for example because the speechrecognition system contains an error or as a result of an intentionalmanipulation of the signaling unit. In so far as the acknowledgement ofreceive activity is a speech signal, the check can be made relativelysimply in that the speech recognition system detects the emission of itsown speech signal with the means with which also the speech signals fromthe user are detected and in the subsequent recognition or processingrespectively, recognizes the emission of its own speech signal as itsown acknowledgement of receive activity.

The user preferably always has the possibility—when, for example, hewishes to give a confidential utterance which should not be captured bya speech recognition system—to deactivate an active speech recognitionsystem by means of a spoken command and to reactivate it again asrequired. A method in which the user has the possibility of temporarilydeactivating the system for a certain period of time is preferred. Afterthe predefined time period has elapsed, the speech recognition systemautonomously switches on again.

More particularly with such an automatically realized switch-over from adeactivated mode to an active receive mode it is advantageous for thespeech recognition system to show the switch-on by itself. Such anactivation message may be, for example, an optical or an acousticmessage, for example, a speech signal again. An acoustic message isadvantageous in so far as the user can record it irrespective of hisposition and the direction in which he is looking.

Additionally, it is possible for the speech recognition system also toshow optically whether it is in the active receive mode. Such apermanent optical message is possible because it usually does notdisturb the user. However, it has the drawback that it cannot berecognized well from any position of the user so that preferably anacoustic signaling should additionally follow to enhance the reliabilityin certain situations, that is, for example in response to said enquiryabout receive activity or in case of an automatic switch-on.

When a plurality of speech recognition systems are used, the user shouldpreferably have the opportunity to access a specific speech recognitionsystem and to deactivate and activate it again. For example, it iscertainly appropriate when the user does not deactivate rudimentaryspeech recognition systems located in the internal area which are onlycapable of recognizing certain command words for controlling certaindevices, but all speech recognition systems which are capable ofrecognizing and processing continuous speech and/or via which the speechdata could reach an external area.

The filtering of the speech data leaving the external area may beeffected automatically. The key speech data mentioned there may be, forexample, keywords, key sentences or whole sequences of key sentences.The speech data are compared with these key speech data during filteringand depending on the match with the key speech data there is thendecided whether the speech data are transmitted or not to the externalarea. It is then possible to predefine both key speech data which can betransmitted without any problem, and key speech data which are certainlynot to be transmitted.

Another embodiment includes a possibility of the user to manually filterthe speech data himself. Such a manual selection after the system hasshown that it will be a transmission of speech data, may naturally alsobe effected in addition to an automatic filtering. For example it ispossible to store certain key speech data with which the outgoing speechdata are compared and only when a comparison of the speech data providedfor the transmission with the key speech data shows a match, will thisbe shown to the user and will there be a manual post-filtering or checkby the user.

In a preferred embodiment the second method in which the speech datatransmitted to an external area are filtered or the transmission ischecked by the user is combined with the first method in which the userreceives an acknowledgement of receive activity in response to anenquiry about receive activity. Such a combined speech recognitionsystem, which comprises the two variants, offers the user the fullcontrol of the speech signals uttered by him i.e. the user has controlof the fact that, depending on the degree of confidentiality of theutterances and, as required, with respect to the control possibilitiesin the area used by him, either to totally deactivate the speechrecognition system or simply exactly control or prevent the transmissionof his speech data to the external area.

The invention will be further explained in the following with referenceto the appended drawing Figures by means of examples of embodiment, inwhich:

FIG. 1 gives a diagrammatic representation of a speech recognitionsystem when an acknowledgement of receive activity is issued,

FIG. 2 gives a diagrammatic representation of a speech recognitionsystem which has a component in an external area.

In the example of embodiment shown in FIG. 1 a relatively simple speechrecognition system 1 is concerned which comprises a single systemcomputer unit 6, for example, a PC, on which a speech recognitionsoftware module 2 is implemented. This speech recognition module 2 shownonly as a block 2 obviously comprises not only the usual programportions with the speech recognition algorithms, but suitable libraries,rules of grammar etc. on the basis of which the recognition isperformed. All necessary hardware components such as processor, memorylocation etc. are rendered available by the computer unit 6.

A microphone 5 is connected to the computer unit 6 to capture the speechsignals. The speech signals recorded by the microphone 5 are analyzed inthe computer unit 6 of the speech recognition module 2.

Furthermore, the computer unit 6 includes as a speech output device atext-to-speech converter (TTS converter) 3 by which the speechrecognition system generates speech signals for communication with auser (not shown). Also this TTS converter 3 is a software module. Thespeech signals are output via a loudspeaker 4 connected to the computerunit 6.

The computer unit 6 further includes a control module 7 for driving adesired device or various devices in response to the recognized speechinformation and for driving the speech output unit 3. The control offurther devices (not shown) is performed via the data link 8. Similarly,the control module 7 may also be instrumental in driving the speechrecognition module 2 and/or the microphone 5 or microphone inputrespectively on the computer unit 6. The speech recognition system 1 maythus be activated or deactivated in this manner.

It is once more expressly stated that the speech recognition system 1 isonly a very simple example and that the speech recognition system 1 mayalso be constructed in a more complex form. It may particularly comprisea plurality of different speech recognition modules which have, forexample, different performance and/or are used for differentapplications. The speech recognition modules may then be used asrequired for controlling various apparatus or functional units while itis also possible for certain apparatus to have certain speechrecognition modules fixedly assigned to them. The speech recognitionsystem may also include other speech recognition devices of differenttype. Furthermore, the computer unit 6 may have a large variety ofadditional programs to react to speech commands from the user in apredefined manner, depending on the assignment, for example to control acertain connected device or system. The computer unit may also be acomputer which is further used for other applications, for example, a PCof the user. The speech recognition system may also comprise anarbitrary number of networked computer units over which the varioustasks or software modules, respectively, are distributed.

In order that the user can at any time check whether speech signalsuttered by himself are captured by the speech recognition system 1 andprocessed, he has the possibility of making an enquiry about speechactivity A to the speech recognition system 1. Typically would be here,for example, the enquiry A “speech recognizer active?”. In so far as thespeech recognition system is in the active receive mode, that is that itis in a mode in which speech signals of the user are captured andprocessed, the microphone 5 also automatically captures this enquiryabout receive activity A and the speech recognition module 2 analyzesit. At that point the enquiry A “speech recognizer active?” isrecognized as speech information from the speech signal. The recognizedenquiry A is then processed, for example, by the control module 7. Thiscontrol module 7 is programmed in such a way that in response to arecognized enquiry about speech activity A by means of the TTS converter3 a respective acknowledgement of receive activity B is issued via theloudspeaker 4, for example, the sentence “speech recognizer is active”.

As a result of a failure or a manipulation on the computer unit 6, thedata lines or further devices of the system 1, respectively, for exampleas a result of an interruption of the line from the output of the TTSconverter 3 to the loudspeaker 4, it may happen that no acknowledgementof speech activity B is issued to the user although the user has sent anenquiry about speech activity A to the speech recognition system 1 andthe speech recognition system 1 is in the active receive mode. The userwould then unjustifiably feel “safe”. Therefore, the control module 7 isprogrammed such that a check is made whether the acknowledgement ofspeech activity B issued by the loudspeaker 4 is again captured by themicrophone 5 of the speech recognition system 1 and recognized by itsown speech recognition module 2. In so far as the speech recognitionsystem not again records via its own input channel the acknowledgementof speech activity B within a given time period after thisacknowledgement of speech activity B has been issued, the control module7 will deactivate the speech recognition module 2 at least in so far asthe speech recognition system 1 is still in a position to react to acertain command such as “speech recognition system on”.

In a more complex speech recognition system which comprises a pluralityof speech recognition modules for various apparatus, which apparatus maybe activated and deactivated separately within the speech recognitionsystem, it is highly suitable that it is also made known with theacknowledgement of speech activity in how far the speech recognitionsystem 1 is active, for example, by issuing “speech recognition for TVand for video recorder are switched on”.

Similarly, for a parallel use of various speech recognition systems 1 itis suitable for the acknowledgement of speech activity B to containinformation about which speech recognition system answers, for example,via a message “speech recognition system for kitchen area is active”.Conversely, the enquiry about speech activity A may not only be directedto certain systems but, globally, to all speech recognition systems inthat the user, for example, makes a special enquiry about activity suchas “speech recognition system for Internet and telecommunicationactive?” or a general enquiry about activity such as “any speechrecognizer active?”. This is especially suitable if, for example, onlycertain systems have a link to an external area and/or are in a positionto understand continuous speech, whereas other speech recognitionsystems understand only a limited number of command words.

In case of a deactivation, the respective speech command may also begiven globally to a plurality of speech recognition systems. Forexample, a command “all speech recognizers down for five minutes” can bereceived and processed by all the speech systems which are in the activereceive mode at this point of time. The command may, however, also begiven to individual speech recognition systems or individual speechrecognition modules in a speech recognition system, which arespecifically mentioned by the user.

FIG. 2 shows a simple example for a speech recognition system 10 whichhas a similar structure to the speech recognition system 1 shown inFIG. 1. This speech recognition system 10 also includes a computer unit7 with a speech recognition module 11, a TTS converter 12 and a controlmodule 13. Similarly, a microphone 8 for capturing speech signals fromthe user and the loudspeaker 9 for issuing speech signals from the TTSconverter 12 are connected to the computer unit 7. This computer unit 7in the speech recognition system 10 is situated in an internal area Iwhich can be very well controlled by the user; for example this is a PCin the user's office.

A further component of the system 10 is located, on the other hand, on acentral server 15 in an external area E, for example, on a server 15 ofan intranet of a company. In case of certain actions speech data S, i.e.the user's speech signals recorded by the microphone 8 or speechinformation recognized from the speech signals by means of the speechrecognition device 11, is transmitted to the external server 15 and thusto the intranet via the link from the computer unit 7 of the speechrecognition system 10 to the external server 15. The user himself thenusually has no control of what happens with these speech data and inwhat form these speech data are used, stored and processed in additionto their own application. Therefore, the speech recognition system 10according to the invention offers the user the possibility of checkingthe transmission of these speech data S to the external area E.

In the example of embodiment actually shown it is speech informationalready recognized by the speech recognition module 11 that istransmitted to the server 15 to surf on said intranet, for example, viathe computer unit 7. This means that in this case not the speech signalof the user himself but the speech information recognized from thespeech signal is transmitted to the server 15.

To avoid speech data S being transmitted to the external area E in a wayundetected by the user, the outgoing speech data S are filtered in afilter 14 which is located in the computer unit 7 located in theinternal area I. The filter 14 is here also a software module with anassociated memory area in which keywords or

-   -   word combinations are stored which can be freely selected by the        user. They are then, for example, keywords or    -   word combinations from which the user desires that he first gets        a warning when certain speech data S contain these keywords or        keyword combinations when they are to be transmitted to the        external area E. Therefore, all outgoing speech data S are first        compared with the keywords or keyword combinations respectively.        In so far as speech data S contain these keywords or keyword        combinations, the control module 13 causes the TTS converter 12        to issue a warning to the user through the loudspeaker 9.

This warning contains, for example, a reproduction of the speech data Swhich are on the point of being output. The user is then requested togive an acknowledgement for the transmission i.e. the speech recognitionsystem 10 once more asks the user whether it is allowed to transmit thecertain speech data S to the external area E.

The invention guarantees that persons who utilize speech recognitiontechnologies in their daily lives may rest assured that these techniquesare not misused for intrusion of privacy. The methods and speechrecognition systems proposed consequently ensure that the ergonomicadvantages of an “environment intelligence” understanding speech cannotbe extended to a control system controlling the user. The user may thususe the advantages of the speech recognition systems and yet keep fullcontrol of the speech recognition systems, particularly by the knowledgeof which speech recognition systems are active and in how far data leavea certain field of privacy.

1. A method of operating a speech recognition system comprising the actof: analyzing, by a processor, a speech signal from a user forrecognizing speech information contained in the speech signal; emittingan acknowledgment of a receive activity to the user in response to aninquiry from the user as to whether the speech recognition system is inan active receive mode where speech signals are captured and processed;wherein the acknowledgment of the receive activity comprises a speechoutput from a speech output device; wherein the speech output comprisesissuing a specific message acknowledging that the speech recognizer isactive; controlling output of the acknowledgment of the receive activityand, in case of an erroneous output of the acknowledgment of the receiveactivity, the speech recognition system deactivates itself.
 2. Themethod as claimed in claim 1, wherein the acknowledgement of the receiveactivity comprises information for identification of a specific one ormore speech recognition systems out of a plurality of speech recognitionsystems which is in the active receive mode.
 3. The method as claimed inclaim 1, further comprising the act of recognizing output of theacknowledgement of the receive activity by acoustically detecting outputof the acknowledgement as a speech signal from the speech output device.4. The method as claimed in claim 1, further comprising the act oftemporarily deactivating the speech recognition system by a deactivationcommand from the user, the deactivation command containing a timeparameter which predefines for how long the speech recognition system isdeactivated.
 5. The method as claimed in claim 1, further comprising theact of showing by the speech recognition system when it is switched overto an active receive mode.
 6. A speech recognition system comprising:means for capturing a speech signal from a user; means for analyzing thespeech signal for recognizing speech information contained in the speechsignal; a signaling device for sending an acknowledgment of receiveactivity to the user to indicate that the speech recognition system isin an active receive mode in which speech signals are captured andprocessed; wherein the speech recognition system is arranged so that,while being in the active receive mode, the speech recognition systemrecognizes an inquiry about receive activity from a user by which theuser queries whether the speech recognition system is in the activereceive mode, and subsequently sends the acknowledgment of the activity;wherein the acknowledgment of the receive activity comprises a speechoutput from a speech output device; wherein the speech output comprisesissuing a specific message acknowledging that the speech recognizer isactive; means for controlling output of the acknowledgment of thereceive activity and, in case of an erroneous output of theacknowledgment of the receive activity, the speech recognition systemdeactivates itself.
 7. A non-transitory computer readable mediumembodying a computer program, the computer program when executed by aprocessor is configured to operate a speech recognition system includingperforming the act of: analyzing a speech signal from a user forrecognizing speech information contained in the speech signal; emittingan acknowledgment of a receive activity to the user in response to theuser in response to an inquiry from the user as to whether the speechrecognition system is in active receive mode where speech signals arecaptured and processed; wherein the acknowledgment of the receiveactivity comprises a speech output from a speech output device; whereinthe speech output comprises issuing a specific message acknowledgingthat the speech recognizer is active; controlling output of theacknowledgment of the receive activity and, in case of an erroneousoutput of the acknowledgment of the receive activity, the speechrecognition system deactivates itself.
 8. A speech recognition systemcomprising: a processor; a speech recognizing module configured torecognize an inquiry from a user asking whether the speech recognitionsystem is active; an output device configured to provide a response tothe user responsive to the inquiry; a controller configured todeactivate the speech recognition system if the response is notprovided; wherein the response of the receive activity comprises aspeech output from the output device; wherein the speech outputcomprises issuing a specific message acknowledging that the speechrecognizer is active; wherein the speech recognition system deactivatesitself, in case of an erroneous output of the response of the receiveactivity.